# README

## Description

This folder contains three code files:

1. `ShapCalc.py`: Calculates the data Shapley value of negative samples using the Monte Carlo method.
2. `SDAE.py`: Trains a Stacked Denoising Autoencoder (SDAE) for manifold learning.
3. `DBSCAN.py`: Performs density-based clustering on the samples mapped with high data Shapley values in the manifold space.

## ShapCalc.py

The input for this script is pre-segmented data. The entire admission dataset is divided into training and testing sets using a 90%-10% split. The training set is further divided into negative samples (`X_train0`, `y_train0`) and positive samples (`X_train1`, `y_train1`). The data format is as follows: `X` is a 2D numpy array with the shape (n_samples, 709), and `y` is a 1D numpy array with the shape (n_samples,).

Since calculating the data Shapley value is a time-consuming task, involving training $N^-$ (number of negative samples) classifiers for a given random Monte Carlo permutation, our implementation uses multiprocessing to concurrently perform independent Monte Carlo processes. The results, which are the margin contributions of each negative sample for a particular permutation, are saved as files for later aggregation.

For each Monte Carlo permutation, i.e., iteration, the script will generate a file named `tmc_result_{random_string}.pkl` containing the results. After completing all 100,000 iterations, the margin contributions are averaged to obtain the final data Shapley values.

## SDAE.py

This script implements the Stacked Denoising Autoencoder (SDAE) using PyTorch.

The input data consists of two parts: `tmc_shapley_results.pkl` represents the aggregated data Shapley values of negative samples from the previous step, and `coordinates` represents `X_train0`. The training of SDAE for manifold learning is performed in an unsupervised manner using data Shapley values (aka. temperature) and coordinate information.

The trained SDAE model will be saved for future use.

## DBSCAN.py

The input for this script is `encoded_coord_temp.pkl`, which includes the encoded coordinates and corresponding data Shapley values of negative samples in the manifold space. We select the top 40% of high data Shapley value samples and apply DBSCAN to identify hot zones as automatically recognized cohorts.

To select appropriate parameter combinations, we first explore and test different values of $P_{min}$. For a given $P_{min}$ value, we calculate the 75th percentile of the distances to the $(P_{min}/2)$-nearest neighbors of the high data Shapley values. We consider this value as an appropriate $\varepsilon$. We assume that local density regions exceeding twice the global density upper bound are high-density regions.

## Dataset

Our experiments are conducted on Electronic Medical Records (EMR) generated from real hospitalized patients in our clinical practice. Due to data usage protocols and patient privacy protection, we are unable to distribute any data.

